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ABSTRACT 

The study reported in this paper originated from a 
need to design a short English for Academic Purposes (EAP) 
proficiency test for incoming undergraduate students. The approach to 
solving the problem was to focus on establishing consistent person 
comparison between the students at Hong Kong Baptist College who did 
not meet the minimum entry requirement in the English language and a 
reference group who met the required grade level. The comparison was 
made on the basis of a short Engl ish^as-a-^Second-Language (ESL) 
proficiency test taken by both students groups. The establishment of 
equivalence between the reference and studei^t groups was achieved 
through the employment of FACETS (Linacre and Wright, 1990). The 
logit level of the reference group (-0.93) can be taken as the 
equivalence of the minimal required ESL level for entry into 
university. Findings suggest the possible extensions of the Rasch 
model in terms of both item calibration and person measurement 
through the employment of FACETS. The EAP Test is appended. (Contains 
19 references.) (JP) 
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Taking a Bultl>fac«t«d vl«w of thm uni-dlMnsional Masurraant froa Rasch 
analysis In languaga tasts 



Tony Lee 



Ic Introdu^*:ion 

The advent of Item Response Model (IRM) to the field of language testing (e.g. 
Manning 1984, Henning et al* 1985, Griffin et al. 1985, Woods & Baker 1985, 
Pollitt & Hutchinson 1987 and Choi & Bachman 1992) has been a »ost important 
development in the recent history of the discipline. IRM has given language 
testing a rigorous basis for measurement* The catch, though, is that IRM is 
primarily a measurement model with little or no immediate implications for 
languaga testing research. Specifically, the uni-dimensionality assumption in 
IRM has been an initial stumbling block for many language testers. It is 
argued that, if language is inherently complex, it would be strait- jacketing 
language testing research by forcing the uni-dimesional condition onto all 
language data. (See Bachman 1990 and Henning 1992 for an interesting 
discussion. } 

Conceptual analysis (eg. Reckase 1979, Henning et al. 1985, 1992, Choi 4 
Bachman 1992 ) has helped to define the scope of the uni-dimensionalty 
assumption and to resolve the apparent dilemma. Research designs encompassing 
an IRM component have also been developed; and this has helped the applied 
linguistic research dimension of IRM. 

II. Rasch model as a research tool 

Wright & Masters (1982) maintain that the uni-dimensionali'^y assumption is a 
"universal characteristic of all measurement". This, however^ should not in 
theory preclude analyses over-and-above an IRM analysis. Jensen (1978), for 
example, warns of "... a flagrant conceptual and scientific bliinder ... to 
orthogonal rotation of principal components or factors without first 
extracting the general factor (i.e. the first principal component or first 
principal factor)". Indeed, IRM can easily be conceptualized as a rigorous way 
to extract the general factor. The standardized residuals from an IRM analysis 
would provide data for further analyses as envisaged by Jensen. Pollitt 
(personal communication) suggests using residual analysis to tap specific 
dimensions within behavioural data after the latent trait has been extracted. 
Lee (1992) analyzes the residuals to establish the construct validity of an 
ESL reading test. 

From the development within IRM itself, multi-faceted Rasch analysis (Linacre 
1989a) is the expansion of the one-parameter Rasch model to encompass analysis 
of facets in the data. This has enabled IRM to be employed in diverse research 
design and analysis configurations and data collection schedules* 

III. Many-Faceted Rasch Analysis 

Linacre (1989 a&bf^ argues and demonstrates the possbility of extending the 
initial one-parameter (or two-facet) Rasch model to n-facet models. This is an 
interesting development. Constituents within a complex human behavioural 
context can now be accommodated within the same IRM model for analysis. 
Typically facets can include judges of human performance (eg. in a writing 
test), or sub-groupings of subjects/candidates, or sub-test item groups. With 
the flexibility introduced, research designs can now be developed which would 
do greater justice to features in human behavioural (eg. varying severity of 
judges, cultural and/or economic background of subjects) . In addition, FACETS 
(Lincare & Wright 1990) , which is the software implementation of multi-facetd 
Rasch analysis can generate interaction analyses of the facets. I.ae et al. 
(forthcoming) uses FACETS to calibrate and to establish the scale structure of 
the Australian Second Language Proficiency Ratings (ASLPR) (Ingram & Wylie 
1979). Leo (in preparation) examines, via FACETS, ESL program entry level and 
test time interaction, and rater and ethnic background interaction in the 
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National Languages and Literacy Institute of Australia (NLLIA) ESL Bandscales 
(MacKay et al. 1992) • 

IV. The Study 

h. The Background 

The study reported in this paper originated from a need to design a short EAP 
proficiency test for incoming nature undergraduate students. A policy decision 
of the Hong Kong Baptist College in 1992 to admi** former non-degree graduates 
into undergraduate degree programs resulted in a situation where the mininwni 
entry requirement in tho English language (Grade D in the Use of English 
examinations of the Hong Kong Examinations Authority) was not met by some of 
the admi'ctants. It was difficult for these students to re-take the Use of 
English examinations and uneconomical to administer a facsimile version. The 
Language Centre was given the charge to find a means to establish the 
equivalence of the required minimum ESL level for entry into degree programs. 
The approach to solving the problem was to focus on establishing consistent 
person comparison between the students in question and a reference group with 
the required Grade D level in the use of English Examinations. The comparison 
was made on the basis of a short ESL proficiency test taken by both groups of 
candidates. 

B. The EAP Test 

Practical and monitary constraints necessitated the choice of the modified 
cloze format based on a single reading passage. Two sets of items were 
prepared. The first consisted of 52 proofreading items relating to grammatical 
features in the first part of the passage. The second set consisted of 44 gap- 
filling itexQB relating to cohesion features. 

The test waL first piloted on a group of undergraduate students covering the 
whole range of the Use of English grade levels, Tho test was then given to 221 
mature students. A reference group of students (n = 38) with the required 
minimum Use of English grade was also given the tost. 

C. The research design 

The overall design of the study was to obtain a comparison between the 
'student' group and the 'reference' group based on the EAP teat. As the 
reference group was a sample of those who had achieved the required minimum 
English language standard for entry into universities those in the 'student' 
group who would match the level of the 'reference' group in the EAP test had 
to be considered as having an equivalent level of English language ability. 

V. The Analysis 

To achieve the objectives described above it was necessary to have an ability 
scale that was robust and consistent and to make the required comparison of 
the two groups of candidates beyond the particular EAP test given. This was a 
typical sample free test calibration and test free person measurement in IRM. 
In addition, the analysis had to calibrate the two sub-groups of candidates 

(student and reference). Multi-faceted Rasch analysis was thus necessary. It 
was also thought relevant to calibrate the two parts of the EAP test to see if 
mastery of grammar and cohesion features were distinguishable in the EAP test. 
FACETS (Linacre & Wright 1990) was employed. Four facets were included in the 
analysis: the candidates, the two candidate sub-group: 'Student' and 

'Reference, the two sub- tests and the test items. 

VI. Results 

A . Uni -d imens iona 1 i ty 

An informal test of uni-dimensionality was performed via maximum likelihood 
factor analysis. A first factor containing 21% of the overall variance was 
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obtained. This was considered sufficient to make a uni-dimensionality claim 
for Rasch analysis (Rechase 1979, Henning et al. 1985). 

5. Iten Calibration 

Table 2 contains detailed item calibaration of the EAP test. The leftmost 
column contains descriptive statistics: 'Score' is the raw score of the item 
across all candidates; 'Count' is the number of score points and 'Average' the 
item facility value. The second colxamn contains the item calaibratiion 
statistics: the logit and its associated standard error. The third column 
contains the fit statistics. FACETS includes two types of fit statistics: the 
/ Infit and the Outfit. The former is an information-weighted mean-square fit 

statistic and the latter the conventional mean-square. The expected value is 1 
in both. Values greater than 1 would indicate noise in the Infit statistic and 
an oulier in the Outfit statistic. 
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1 88.0 259.0 0.3 | -0.00 0.20 | 1.0 0.1 1.0 O.ll Mean of Count: I 
I 70.2 0.0 0.3 I 1.71 0.14 1 0.0 0.6 0.1 0.8| S.D. 96 1 



RMSE 0.25 Adj S.D. 1.70 Separation 6.86 Reliability 0.98 
Fixed (all same) chi-square: 5237.64 d.f.: 93 significance: .00 



Table 1; Item Measurement Report (ordered by N) ♦ 

The range of item difficulties covered extend from logit -4 ♦SB 
(Item 48) to 4*41 (Items 61 & 68) ♦ Most of the items are accepted 
by the model with the exception of Item 18 (Infit: 1*0, Outfit 
1.3), Item 42 (Infiti 1.0, Outfit: 1.2), Item 70 (Infit: 1.0, 
Outfit: 1.5), Items 51 and 93 have been answered correctly by none. 
The test has thus ninety-one items accepted by the model with a 
fairly wide range of difficulty levels. 

FACETS also reports test of the overall calibration of a facet. 
These are found at the bottom of the table. ('RMSE' is the root 
mean square standard error; 'Adj S.D.' is the standard deviation of 
the estimates after removing measurement error; 'Separation' is a 
measure of the relative spread of the estimates; 'Reliability' is 
the Rasch equivalent to the KR-20 or Cronbach Alpha statistics. 
'Fixed chi-square' is the goodness-of-f it test for the elements' 
sharing the same measure after allowing for measurment error.) In 
the case of the item calibration, the differences (separation) 
among the items are found to be realiably distinct (reliability: 
0.98) and the measurement variable established is consistent. 
C. The Sub-tests 

Table 2 reports the calibration of the two sections of the test. 
Part 2 (cohesion features - logit 0.14) is more difficult than Part 
1 (grammar features - logit -0.14). The fit statistics are all 
within the accepability level. The two sections are also reliably 
distinct (reliability: 0.97) and the measuring variable consistent. 
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1 Score Count Average | 


Measure Model 
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std 
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Part 1 


1 4782 11914 
1 3666 12950 


0.4 1 

0.3 1 


-0.14 
0.14 


0.02 
0.02 


I 1.0 
1 1.0 


0 
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1.0 0 1 
0.9 0 1 


Part 1 I 
Part 2 1 


{ 4224.0 12432.0 
1 558.0 518.0 


0.3 1 
0.1 1 


0.00 
0.14 


0.02 
0.00 


I 1.0 
1 0.0 


0.1 
0.3 


1.0 -0.2| 
0.0 0.51 


Mean 1 
S.D. 2 1 


RMSE 0 . 02 Adj 
Fixed (all same) 


S.D. 0.14 Separation 
chi-square: 75.11 d. 


6.05 
f.: 1 


Reliability 0.97 
significance: .00 





Table 2; Sub-test Measurement Report 
D. The Student Facet 

Owing to the large number of candidates it is not practicable to 
include a detailed person measurement raport in the paper. The 
overall range of candidate ability is between logit -1.15 to 1.42. 
Table 3 reports the calibration report of the two candidates sub- 
groups: 'Student' and 'Reference'. The 'Reference' group (logit: - 
0.93) is calibrated higher than the 'Student' group (lofit: -1.16) 
with a reliability of separation at 0,92 and a significant overall 
measurement fit. 
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0.01 1 


0.0 
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0.0 0.2{ 


S.D. 2 ) 




RKSB 0.03 Adj 


S.D. 0. 


11 Separation 


3.31 


Reliability 0.92 






rixea (all sante) 


Chi -square: 23. 


94 d.f 


: 1 


significance: .00 







Table 3; Group Measurement Report (ordered by N) • 
VII. Discussion 

The principal research question in the study: the establishment 
of equivalence between the 'Reference' and the 'Student' groups has 
been achieved through the employment of FACETS, The logit level of 
the 'Reference' group (-0,93) can be taXen as the equivalence of 
the minimal required ESL level for entry into university. The 
concept of equivalence should be correctly understood* Equivalence 
here refers to the two groups of candidates on the basis of the 
test administered. It does not refer to the EAP test and the Use of 
English examinations. Thus, while the two groups of candidates have 
been compared regarding ESL ability, they have not been compared 
regarding possible equivalence in the results of the Use of English 
examine t ions , 

B, The calibration of the two parts of the test is interesting in 
that it enables analysis of groupings of test items. The analysis 
reported is in fact a construct validation study as suggested by 
Wright & Masters (1982;93); 

"The pattern of item calibration provides a description of the 
reach and hierarchy of the variable. This pattern can be 
compared with the intentions of the item writers to see if it 
confirms their expectations concerning the variable they wanted 
to construct. To the extent that it does, it affirms the 
construct validity of the variable," 
The finding that cohesion features require a more advanced 
(difficult) ESL ability to master than grammar seems to confirm 
applied linguistic and TESL theory, and the views of many TESL 
colleagues* 

As an item oriented technique Rasch analysis can be used for item 
oriented construct validatiaon (eg, Lee 1992) , As FACETS allows for 
facets of item sub-groups to ,be included in the analysis, construct 
validation can also be carried out on item sub-groups. In the study 
reported it may not be very instructive to estimate the construct 
validaty directly from the items. Using the group in^rs of the items 
as it has been done would maXe more sense in terms of both 
computation and applied linguistic theory, 

VII, Concusion and implications 

What the study has shown are possible exetnsions of the Rasch model 
in terms of both item calibration and person measurement through 
the employment of FACETS, Indeed the pacXage allows for a maximum 
of nine facets to be calibrated simultaneously. Such extensions are 
particularly attractive to those colleagues who, while appreciating 
the rigour in measurement offered by the Rasch model, would be 
apprehensive of the danger of being strait-jacXeted in their 
applied linguistic research. What has been demonstrated in the 
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paper is that FACETS is able to maintain the rigour of the Rasch 
model and to provide the applied linguist with intersting research 
design possibilities. By z\o doing, FACETS has outgrown the Rasch 
model from being a strictly measurement model to a general research 
tool and enables language testers to "devote their creative powers 
to designing tests which involve deeper and more relevant evidence 
of competence. . . " (Linacre 1989b: 10) . 
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Appendix: the EAP Test 



What (DESOfilai buy today, they throw away tomorrow. But (2) find 
somewhere to put the rubbish is (3) become harder and more 
expensive. Anerica's Environmental Protection Agency (4) estimate 
that 80% of the country's landfills will shut (5) in 2010. Japan 
looks (6)- running out of usable space by 2005. Holland has more 
or less (7)runed out already, other options are no easier. Most 
industrial countries (8), agree two years ago to discourage shipment 
of hazardous waste (9) for the third world. No wonder that 
(10) rich-countries governments consider waste (11) disposals as 
their most (12) pressed environmental problem. ^ 
The problem is (13) large manmade. Rarely is there an absolute 
shortage (14)in space to put more rubbish dumps. But nobody 
(15)want a dump, or an incinerator, (16)in next door. So the piles 
Of waste grow, while the places to pile them diminish. This 
affects (17) company in two ways. First, (I8)3et rid of hazardous 
waste 18 (19) become more expensive. This is partly because landfill 
(20) cost have soared; and also because companies now face lengthy 
paper-chase, filling (21)- forms that record every stag of their 
(22)wa8te progress, from factory gate to (23)the dump. As a 
result, more and more companies disposel (24)- their own hazardous 
waste; or they (25) (expensive^ change the way they work so as to 
reduce, (26)- amount they create. Secondly, the difficulty, (27)- 
getting rid of ordinary household rubbish is driving some 
(28) government to impose new obligations (29)to companies, malcing 
them take back their products when (30)- customer wants to be rid 
of them. That in turn is (31) change the way companies design 
(32) product like computers and cars. 

Government 33 - caught between voters, who do not want more dumps 
34 or incinerators, and consumers, who want to go on 35 buy 
things that will one day be rubbish. Confronted by the incompatible 
wishes Of each 36 citizens, governments oten expect companies to 
provide 37 - answers. Sometimes this is sensible, but not 38 alwav . 
one grand priece of foolishness: most of 39 America federal 
environmental spending goes on the pursuit 40 for companies that 
once dumped hazardous waste (usually legally) , to make 41 it payfor 
4. clean up old sites. So far, it is mainly lawyers who have 
cleaned up. when the law has not been broken, 43 - cost of clearing 
old 44 dume ought to be carried by the taxpayer. As for new waste, 
the cost of getting rid of it should 45 rests on 46 - companies 
that create it. 

47 Other piece of foolishness: the unquestioning 
assumption that recycling is 48 - best way to reduce the 
mounds of municipal rubbish. This belief starts 49 in a 
self-evident truth ~ that if 50 bottle and tins can lead a second 
life, there will be 51 fewer waste, But the argument is then 
taken 52 - irrational lengths, with governments setting targets 
for 53 - amount of an industry's product that has to be reused. 

Recycling is sometimes an efficient solution; 54 it is not. 

The materials 55 are easiest to recycle (such as aluminium* 

cans) are rarely newspapers and companies, rarely 56 which 

bulk largest in landfills (57 newspaper and dir^ries) . 

Recycling schemes, 58 run by towns or by companies, rarely 

work 59 subsidy; the only way to make 60 economically self- 
supporting is to create a steady demand for 61 final product. 

Difficult m theory, impossible 62 practice: markets for raw 

materials are notoriously unsteady, 63 that is as true for 

recycled pulp and plastic 64 
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for cocoa and chemicals. 



65 _^ than subsidising one solution, or bullying companies into 

adopting 66 , governments need to tacXle the root causes of the 

municipal-rubbish mountain. Most goods in short supply become 
increasingly expensive, warning people to change their ways. That 

is 67 so with rubbish disposal. People pay nothing to throw 

away an extra piece of trash. 68 , the old newspapers and bottles 

in the rubbish bin magically vanish 69 the dustmen cart them 

away. The first goal for policy should be to make polluters carry 
the true financial and environmental costs of waste disposal, and 
7 0 leave them to decide the most efficient response. 

One good way to induce companies to cut waste is to set in industry 
a national target for 71 contribution to the waste stream, and 

leave companies to decide how best to meet it. 72 has been the 

approach in Holland, 73 industries have accepted a goal of 

cutting packaging by 10% 74 the end of the century; and in 

France, 75 environment minister has asked industry to come up 
with ideas to cut waste sharply oy the end of the century. Best of 
all would be to allocate companies quotas (76 called "credits") 

for the amount of waste 77 contributed to the nation's bins; 

and 78 encourage those 79 reduced most cheaply their share 

of rubbish to sell off spare 80 to those who found it more 

costly to cut back. A variation on 81 idea — suggested by 

Project 88, an American public-policy study — would encourage 

newspapers to use more recycled fibre 82 setting a national 

target, and then allowing papers 83 beat it to sell their spare 

"share" to others that failed to meet it. 

Lots of countries try to coax people to return bottles by insisting 

on a refundable deposit. 84 schemes strike many people as 

fairs they tax only 85 who chuck the bottle away. 86 the 

size of the deposit, 87 the costs of administering the 

scheme, are generally far greater 88 

the environmental damage caused by discarded bottles. It would be 

89 to save such tactics for those really hazardous items which 

people sometimes dump in dustbins and ditches: 90 the lead-acid 

battery, 91 the main source of lead in America's environment. 

Some American states, including Maine 92 Phode Island, find 

deposits on car batteries encourage people to bring them back — 93 

if refundable deposits are set too high, 94 encourage naughty 

people to steal batteries. 

In most countries the supply of rubbish is growing 95 the supply 

of rubbish dumps are shrinking. 96 it is not enough to 

reduce the supply of waste; governments 97 need to increase the 

supply of sites. 98 way may be to encourage local people to see 

these sites 99 a source of income. 100 tough safety rules 

are set and policed, cities and states could look for ways to 

reward directly 101 who agreed to live near an incinerator or a 

waste tip. Getting rid of other people's rubbish has always been a 

perfectly respectable way to earn a living. 102 when modern 

societies start putting a value on 103 will they realise just 

how much it is worth. 



